33 research outputs found
Distributed multinomial regression
This article introduces a model-based approach to distributed computing for
multinomial logistic (softmax) regression. We treat counts for each response
category as independent Poisson regressions via plug-in estimates for fixed
effects shared across categories. The work is driven by the
high-dimensional-response multinomial models that are used in analysis of a
large number of random counts. Our motivating applications are in text
analysis, where documents are tokenized and the token counts are modeled as
arising from a multinomial dependent upon document attributes. We estimate such
models for a publicly available data set of reviews from Yelp, with text
regressed onto a large set of explanatory variables (user, business, and rating
information). The fitted models serve as a basis for exploring the connection
between words and variables of interest, for reducing dimension into supervised
factor scores, and for prediction. We argue that the approach herein provides
an attractive option for social scientists and other text analysts who wish to
bring familiar regression tools to bear on text data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS831 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Multinomial Inverse Regression for Text Analysis
Text data, including speeches, stories, and other document forms, are often
connected to sentiment variables that are of interest for research in
marketing, economics, and elsewhere. It is also very high dimensional and
difficult to incorporate into statistical analyses. This article introduces a
straightforward framework of sentiment-preserving dimension reduction for text
data. Multinomial inverse regression is introduced as a general tool for
simplifying predictor sets that can be represented as draws from a multinomial
distribution, and we show that logistic regression of phrase counts onto
document annotations can be used to obtain low dimension document
representations that are rich in sentiment information. To facilitate this
modeling, a novel estimation technique is developed for multinomial logistic
regression with very high-dimension response. In particular, independent
Laplace priors with unknown variance are assigned to each regression
coefficient, and we detail an efficient routine for maximization of the joint
posterior over coefficients and their prior scale. This "gamma-lasso" scheme
yields stable and effective estimation for general high-dimension logistic
regression, and we argue that it will be superior to current methods in many
settings. Guidelines for prior specification are provided, algorithm
convergence is detailed, and estimator properties are outlined from the
perspective of the literature on non-concave likelihood penalization. Related
work on sentiment analysis from statistics, econometrics, and machine learning
is surveyed and connected. Finally, the methods are applied in two detailed
examples and we provide out-of-sample prediction studies to illustrate their
effectiveness.Comment: Published in the Journal of the American Statistical Association 108,
2013, with discussion (rejoinder is here: http://arxiv.org/abs/1304.4200).
Software is available in the textir package for
One-step estimator paths for concave regularization
The statistics literature of the past 15 years has established many favorable
properties for sparse diminishing-bias regularization: techniques which can
roughly be understood as providing estimation under penalty functions spanning
the range of concavity between and norms. However, lasso
-regularized estimation remains the standard tool for industrial `Big
Data' applications because of its minimal computational cost and the presence
of easy-to-apply rules for penalty selection. In response, this article
proposes a simple new algorithm framework that requires no more computation
than a lasso path: the path of one-step estimators (POSE) does penalized
regression estimation on a grid of decreasing penalties, but adapts
coefficient-specific weights to decrease as a function of the coefficient
estimated in the previous path step. This provides sparse diminishing-bias
regularization at no extra cost over the fastest lasso algorithms. Moreover,
our `gamma lasso' implementation of POSE is accompanied by a reliable heuristic
for the fit degrees of freedom, so that standard information criteria can be
applied in penalty selection. We also provide novel results on the distance
between weighted- and penalized predictors; this allows us to build
intuition about POSE and other diminishing-bias regularization schemes. The
methods and results are illustrated in extensive simulations and in application
of logistic regression to evaluating the performance of hockey players.Comment: Data and code are in the gamlr package for R. Supplemental appendix
is at https://github.com/TaddyLab/pose/raw/master/paper/supplemental.pd
Estimation and Inference about Heterogeneous Treatment Effects in High-Dimensional Dynamic Panels
This paper provides estimation and inference methods for a large number of
heterogeneous treatment effects in a panel data setting with many potential
controls. We assume that heterogeneous treatment is the result of a
low-dimensional base treatment interacting with many heterogeneity-relevant
controls, but only a small number of these interactions have a non-zero
heterogeneous effect relative to the average. The method has two stages. First,
we use modern machine learning techniques to estimate the expectation functions
of the outcome and base treatment given controls and take the residuals of each
variable. Second, we estimate the treatment effect by l1-regularized regression
(i.e., Lasso) of the outcome residuals on the base treatment residuals
interacted with the controls. We debias this estimator to conduct pointwise
inference about a single coefficient of treatment effect vector and
simultaneous inference about the whole vector. To account for the unobserved
unit effects inherent in panel data, we use an extension of correlated random
effects approach of Mundlak (1978) and Chamberlain (1982) to a high-dimensional
setting. As an empirical application, we estimate a large number of
heterogeneous demand elasticities based on a novel dataset from a major
European food distributor